1 Introduction to ChIP-seq

1.1 What is ChIP-seq?

Chromatin immunoprecipitation experiments followed by sequencing (ChIP–seq) detect protein–DNA binding events and chemical modifications of histone proteins (Figure 1).

Comparison of ChIP-seq experimental protocols (adapted from Furey 2012, Nature Reviews).

Comparison of ChIP-seq experimental protocols (adapted from Furey 2012, Nature Reviews).

Profiles obtained from transcription factor (TF) ChIP-seq and histone ChIP-seq are different (Figure 2):

  • TF ChIP-seq detects enrichment at the cis-regulatory element (CRE), where the TFs bind.
  • Histone ChIP-seq detects enrichment at the sorrounding nucleosomes, where histone residues are modified.
The transcription factor and histone modification landscape for inactive and active cis-regulatory elements (CRE; promoters and enhancers) and the corresponding read profiles (from Pundhir et al. 2015, Frontiers in Genetics).

The transcription factor and histone modification landscape for inactive and active cis-regulatory elements (CRE; promoters and enhancers) and the corresponding read profiles (from Pundhir et al. 2015, Frontiers in Genetics).

When sequencing ChIP-seq samples, we usually need a control for the antibody inespecific binding, known as input. We can use as input bulk DNA or DNA immunoprecipitated with IgG.

1.2 Sequencing parameters

You have to consider that when sequencing ChIP-seq, many of the parameters you can choose depend on the specifics of your experiment, your question and your budget! However, to provide you with some general guidelines, this are some general recommendations:

  • Paired or single-end. Usually single-end reads is sufficient for ChIP-seq analysis.
  • Sequencing depth. Between 30M (for narrow peaks) and 60M (for broad marks) sequenced reads should yield good profiles.
  • Read length. 50bp reads are usually sufficient. Could be interesting to make them longer when dealing with broader regions, but not that necessary when regions are narrow.

2 ChIP-seq Workflow

2.1 The dataset

The datasets we are going to use for this workshop are from Hlady et al., 2019 Hepatology, in which they compare several assays in normal liver with cirrotic liver and hepatocellular carcinoma (HCC) samples from human donors.

All raw data from their assays can be accessed via their GEO ID: GSE112221. The Gene Expression Omnibus (GEO) is a public functional genomics data repository that accepts array- and sequence-based data. Each dataset is given a GEO ID that usually starts with GSE. Inside each dataset, you can find independent samples with identifiers prefixed GSM.

In GEO you can download already processed data, such as bed files. However, we here are interested in downloading the raw fastq files so we can do all the processing ourselves. Therefore, for each sample we are interested in, we need to find the SRA ID (SRX, you can find it in the sample page in GEO) and from there, gather the run ID (SRR) which will give us access to the raw data.

Go to the dataset website at GEO (here). Look around and, using what is explained above, find the run ID for the normal liver sample of the H3K27ac ChIP-seq.
  1. Go to the section Samples (55) and click on + More.
  2. Find the sample named NL1_h3k27ac_h3k27ac_chipseq and click on the GEO sample id (GSM3061124).
  3. Find the section named Relations and then click on the SRA ID (SRX3832997)
  4. On the bottom you will see a table named Runs that includes all the runs for that sample.
  5. The run ID for the normal liver H3K27ac sample is SRR6880492

You can download the files directly from the SRA website by clicking on the run ID, but you can also do so from the terminal using the SRA toolkit tool fastq-dump.

With fastq-dump you will be able to download fastq files directly from SRA. Since these samples are paired-end, you need to include the argument --split-3 to send each pair to a different file. --gzip will compress the fastq output to fastq.gz.

# Download fastq files from SRA database
fastq-dump --split-3 --gzip -O your_folder/ SRR6880492 

Here you have a complete list of the samples we are going to use in this workshop and their SRA run IDs:

Sample names SRA run ID Sequenced reads Description
NL1_h3k27ac SRR6880492 19,808,136 H3K27ac ChIP-seq in Normal Liver
Cirr1_h3k27ac SRR6880493 26,149,171 H3K27ac ChIP-seq in Cirrotic Liver (Patient 1)
HCC1_h3k27ac SRR6880494 18,789,828 H3K27ac ChIP-seq in Hepatocellular Carcinoma (Patient 1)
Cirr3_h3k27ac SRR6880495 25,896,625 H3K27ac ChIP-seq in Cirrotic Liver (Patient 3)
HCC3_h3k27ac SRR6880496 21,954,923 H3K27ac ChIP-seq in Hepatocellular Carcinoma (Patient 3)
NL1_input SRR6880507 35,050,237 Input ChIP-seq in Normal Liver
Cirr1_input SRR6880509 24,301,177 Input ChIP-seq in Cirrotic Liver (Patient 1)
HCC1_input SRR6880511 27,006,532 Input ChIP-seq in Hepatocellular Carcinoma (Patient 1)
Cirr3_input SRR6880513 36,198,135 Input ChIP-seq in Cirrotic Liver (Patient 3)
HCC3_input SRR6880515 14,131,132 Input ChIP-seq in Hepatocellular Carcinoma (Patient 3)

2.2 General workflow

A general workflow for ChIP-seq analysis consists on the following steps:

  1. Qualiy control. Check if reads have good overall quality. Trim adapters and reads if approppriate.
  2. Alignment. Align reads to a reference genome to obtain their genomic coordiantes.
  3. Post-processing. Remove duplicates and unwanted chromosomes.
  4. Peak calling. Find regions of enrichment (TF binding sites or nucleosomes with the histone mark of interest).
  5. Visualization. Visualize profile in a genome browser.

2.3 Some technical details

3 Quality control

Before starting any processing steps, one should check the overall quality of the reads obtained from the sequencing facility. To do so, we are going to use FastQC, a quality control tool for high-troughput sequencing data.

fastqc folder_with_fastq/sample_reads.fastq

After checking that the overall quality is good, one can trim reads and/or filter out reads with low overall quality using different tools1.

Now run fastqc for the control liver sample.
fastqc fastq/NL1_H3K27ac.fastq

4 Alignment

We call alignment (or mapping) to the process of comparing reads to a reference genome, scoring the similarity and then assigning most likely genomic coordinates to each read.

4.1 Reference genomes

A reference genome is a database of DNA sequences, assembled as a representative of a species genome. Usually, one can find two versions of the reference genome: NCBI (starts with GRC) and UCSC (hg for human, mm for mice). The difference between the two basically lies in the naming of the chromosomes: NCBI uses numbers (i.e. 1, 2, X) and UCSC numbers plus chr (i.e. chr1, chr2, chrX).

Nowadays, the most used versions of the human genome are the following:

  • hg19/GRCh37. Even though this is not the latest version, is the most broadly used. If you do not specifically need high accuracy when mapping your reads, you should use this, as many tools (such as GREAT) will only work with this version.
  • hg38/GRch38. This is the newest version. Even though it was relased in 2013, many labs still use the previous version.

There are tools to convert coordinates from one build to another, such as liftOver from UCSC.

4.2 Aligners

Aligners are tools specifically designed to map reads to a reference genome using different algorithms. Many aligners are available, but the most used for aligning ChIP-seq data are:

  • Bowtie2. This is the one we will use in this workshop.
  • BWA.

Both tools allow alignment of single and paired end reads, you just have to specify it when running them. Generating an index is a techinque that many aligners use to speed up their running type. You may think about it as a book index: when having an idex it is then easier and faster to find specific parts of the book (i.e. the reference genome). Paper describing indexing methods

The output of the aligner is a SAM/BAM file containing the read sequence and the assigned coordinates respect to the reference genome. The aligner will report the number of aligned reads (which should be >80% of the sequenced reads) and which of those where uniquely mapped or multi-mapping (they match more than one region in the genome). Aligners such as bowtie2 report the best alginment for multi-mapping reads.

Many papers have tried to compare the performance of the different aligners available. However, my advice to you is to look in the literature for papers analyzing data similar to yours and just copy their methods!

4.3 SAM/BAM files

SAM (Sequence Alignment/Map) and BAM (same as SAM but binary, compressed and indexed) are file types used for storing sequence and alignment information.

Can you tell which one of the folloing files is in SAM and which one in BAM format?

File 1:

�BC�W[�Y�gbb&J��“����έ��q�鮙4۷t�
��P�m&=�m�z���� >+���ºB\(���|}�U�}[��U�._4�����T����������������/�WN?p�Bem��l�n��g��`4�.�'�����z������ �t��� !�I�\)��’1�M#0��@
3�>�”�0��/æ�R��P�T�D ��L&���#�1BM������P2��p��`I$�I}�M�D^t�P�h2f�cFɠ�ό�

File 2:

@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
  • File 1 is a BAM file. As BAM files are binary and compressed, they are not readable.
  • File 2 is a SAM file. SAM files are easily readable but their size is much bigger than BAM files!

BAM/SAM files consist on two different parts:

  • Header: starts with @XX and contains information on the file itself: how it is ordered, which is the reference genome used, which program you used to align the reads, etc.
  • Body: contains information on the sequences, their genomic coordinates, mapping qualities, etc.

You can find more information on all the fields present in the SAM/BAM files from the Samtools documentation.

Can you tell which lines are the header and which lines are the body of this SAM file?

@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
  • The first two lines starting with @HD and @SQ are the header.
    • The first line (HD) is telling you the format version (VN:1.6) and that the file is sorted by coordinates (SO:coordinate).
    • The second line (SQ) contains information on the reference genome used for the alignment: the name of the reference genome (SN:ref) and it length (LN:45).
  • The following lines contain infomation the each one of the reads aligned to the reference genome:
    1. Read name.
    2. Flag. Bitwise information on the read status (mapped, unmapped, paired, unpaired, etc.).
    3. Reference sequence name.
    4. Position in the sequence
    5. Mapping quality (MAPQ)
    6. […]
[…] You can find more information on all the fields present in the SAM/BAM files from the Samtools documentation.

4.4 Hands on! Let’s align this thing!

Step 0: Download and index the reference genome.

Generating an index takes a long time (20 minutes for hg19 using 10 threads). For this reason, we will not be generating the index in this workshop. However, below you can find all the necessary steps to donwload and index your genome with Bowtie2.

Before running an alignment, you will have to download the reference genome you want to use and index it.

For downloading hg19 you can go to UCSC downloads > Human > hg19 > Full data set > hg19.2bit. 2bit is a compression that UCSC uses for making fasta files smaller. You can convert a 2bit file into a fasta file by running twoBitToFasta from UCSC tools.

twoBitToFastq hg19.2bit hg19.fa # Uncompress the file
bowtie2-build hg19.fa hg19 --threads 5 # Create index with Bowtie2

Step 1: Align your reads to the reference genome.

We got our index and we got our reads, so let’s run the alignment! You can specify different parameters for the alignment (check bowtie2 -h to see them all), we are going to use the following:

  • -t: Print the time it takes to perform each step.
  • -x: Index filename prefix (without the trailing .X.bt2).
  • -U: Single-end reads file (if it were paired end you need to specify the first pair file with -1 and the second pair with -2). Fastq files can be gzipped (extension: .gz), so no need to unzip before aligning!
  • -S: Specify file for SAM output.
bowtie2 -t -x index -U single_end_reads.fastq -S output.sam

Step 2: Explore the output of the alignment

  • Percentage of alignment.
  • File size.
  • File format.

5 Post-processing

Once your alignment is done, you usually will have to perform some routine steps to process the the output:

  • The output is usually is a SAM file. You should convert it to BAM, which is a more light-weight way of storing the same data.
  • You should sort your BAM file by genomic coordinates.
  • You can also filter out some reads you do not want in your BAM file: duplicates, reads mapping to ENCODE black-listed regions, chromosomes you are not interested in, etc.
  • You should index your BAM file. As with reference genomes, you can create a index for your BAM file so operations requiring accessing parts of it will run much faster.

Usually, for running operations on BAM or SAM files, the preferred program is Samtools.

5.1 Converting SAM to BAM

This first step can be easily done using samtools view, with the following arguments:

  • -@ 4: Indicates the number (4) of additional threads to use (default is 0).
  • -b: Tells samtools to convert to BAM.
  • -o: Output filename for the BAM file.
samtools view file.sam -@ 4 -b -o file.raw.bam

5.2 Sorting your BAM file by coordinate

Most programs need an input BAM file in which reads are sorted according to their genomic position. To order our BAM file we can use samtools sort with the following parameters:

  • -o: Output filename for the sorted BAM file.
  • -m 2G: Tells Samtools to use up to 2G per thread.
  • -@ 4: Uses 4 __additional__threads.
samtools sort file.raw.bam -o file.srtd.bam -m 2G -@ 4

Do you remember bash pipes? You can use them with samtools too! You can pipe the output of samtools view (a BAM file) to samtools sort adding - , without having to create any intermediate files:

samtools view data/ChIP-seq/BAM/file.sam -b | samtools sort - -o data/ChIP-seq/BAM/file.srtd.bam

5.3 Removing duplicates

This step is usually recommended to avoid PCR amplification bias and also to remove potential biases induced by PCR errors that could then be amplificated. This step can be performed by several tools, but for the sake of simplicity we are going to use samtools markdup, which will remove all reads marked as duplicates (when argument -r is used).

Briefly, samtools markdup checks wether multiple reads have the same coordinates and orientation. When a duplicate is detected, the overall highest quality template is kept and all others have the duplicate flag set (and removed if you use -r)2.

The arguments we will use when running samtools markdup are the following:

  • -r: Remove duplicate reads.
  • -s: Report statistics.
  • -@ 4: Use 4 additional threads.
samtools markdup data/ChIP-seq/BAM/file.srtd.bam data/ChIP-seq/BAM/file.rmdup.bam -r -@ 4 -s

5.4 Indexing your BAM file

As with reference genomes, BAM files need to be indexed so when querying sequences contained in it, we can do it in less amount of time. To do so, we just need to use samtools index. This will generate an additional file with the same name as your input BAM file, but adding the suffix .bai.

samtools index file.rmdup.bam

6 Peak calling

The last step of the processing of ChIP-seq data is the peak calling. In this step, we will run a program that using different algorithms will detect the regions in which there is an enrichment of ChIP-seq reads compared to the background, and will return a list with the genomic positions of this regions (i.e. peaks).

One of the most famous and most used peak callers is MACS2. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions and improves the spatial resolution through combining the information of both sequencing tag position and orientation.

In this step, we will need to use the H3K27ac sample and its input, so we can normalize and remove the inespecific events. Therefore, we will run MACS2 with the following arguments:

  • -f: Indicates the format of the input file, in this case BAM.
  • -t: Treatment file (in this case, H3K27ac).
  • -c: Control or input file.
  • -g: Effective genome size. hs indicates that we are using the human genome.
  • -n: Name for the output files (without any suffixes).
  • --tempdir: Directory for saving temporary files.
  • --broad: Call broad peaks. Since we are analyzing histone marks, it will try to join consecutive peaks corresponding to nucleosomes sorrounding a CRE.
  • --nomodel: Avoid estimation of fragment size.
activate-macs-git-2017.5.15 # Activate MACS2

mkdir tmp/ # Create folder for temporary files

macs2 callpeak -f BAM -t file.rmdup.bam -c file_input.rmdup.bam -g hs -n file --tempdir tmp/ --broad --nomodel

7 Visualization

An important step on the processing of ChIP-seq data, which can be done at any point is visualizing your data. You can visualize all files generated from the alignment forward using a genome browser. We will see some examples using IGV.

This are some example file types one can visualize in a genome browser:

  • BAM files. It is mandatory for the index to be present in the same folder as the BAM file. When you load a BAM file into IGV, you will see that it loads two different tracks: a track with all reads represented (missmatches between read and reference will be highlighted) and a coverage track, which is the sum of all reads aligning to a specific region.
  • Peak files (BED, broadPeak) will be represented as rectangles with their genomic coordinates.
  • bigWig files. This type of file is approppriate when your interested in plotting the coverage of your track from a lightweight file, without any information regarding read sequence or quality. bigWig files can be generated from BAM files converting them to bedGraph (same as bigWig but without compression) and then to bigWig.

The steps to generate a bigWig file are as follows:

  1. Convert BAM to bedGraph using genomeCoverageBed from bedtools.
genomeCoverageBed -bg -scale NUM -ibam file.rmdup.bam -g chr_sizes > file.bdg
  1. Convert bedGraph to bigWig.
bedGraphToBigWig file.bdg chr_sizes file.bw -unc

  1. We will not cover this in this workshop but if you are interested in this, you should check AfterQC

  2. More information on samtools markdup here